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I. Introduction 



Biologically active proteins fold into a native compact structure despite the huge number of possible 
configurations M. Though the mechanism of protein folding is not fully understood, it has been known 
since the re-folding experiments of Anfinsen 0] that globular proteins fold in the absence of any catalytic 
biomolecules. From this fact, it has been established that for proteins, the three dimensional folded structure 
is the minimum free energy structure, and, the information coded in the amino-acid sequence is sufficient 
to determine the native structure ||. The compactness of this unique native state is largely due to the 
existence of an optimal amount of hydrophobic amino-acid residues since these biological objects are 
usually designed to work in water jS| . The relation between the primary one dimensional sequence and the 
final compact three dimensional structure is the task of the protein folding problem. 

In addition to the paradoxical problem of kinetics and time scales of the folding process |]] , there is another 
mystery. If proteins are made randomly by amino acids, the number of all possible such proteins with typical 
length of 100, is far larger than the number of proteins which actually occur in nature. One hypothesis is 
that the naturally selected sequences are special because they are coded for structures that have unique and 
stable native states, allowing for easy folding. Thus a central question of protein evolution is how mutational 
change in the amino acid sequence leads to changes in the structure and stability. 

Some efforts have been made in order to study the stability of proteins against mutation by searching the 
two dimensional configuration space @,|| . One simple model used in these studies is the H-P model |5j . In 
this model there are only two types of chain monomers, hydrophobic (H) and polar (P). Every H-H contact 
between topological neighbours is assigned a negative contact energy, and other contact interactions are set 
to zero. 

Recently Li et al. JlC| ], have looked at this problem in three dimensions. Calculating the energy of all 
possible 27-mers in all compact three dimensional configuration, they have found that, there are a few 
structures, into which a high number of sequences uniquely fold. This structures were named "highly 
designable" and the number of sequences which fold into each state was named its " designability" . In their 



H-P model, they choose the contact energy between H and P monomers by some physical arguments 10 O]. 
Other significant points of their work are: a) Only a few percent of sequences have unique ground state; b) 
There is a jump in energy gap for these highly designable structures. Thus the highly designable structures 
are more stable against mutation and thermal fluctuation. 

Dill and Chan |l2] ] have argued that many of the phenomena observed in proteins can be adequately 
understood in terms of the H-P model, but according to the work of Pande et al. |l3| the designability of a 
conformation does depend on the nature of interactions between monomers. May be any interaction leads 
to some highly designable structures, but different interactions yield different patterns. 

In our work we study this problem for an additive potential. We will show that there are some highly 
designable structures for this potential, but the low designable structures will disappear because of degeneracy 
of ground state. We will show that there is a ladder structure for energy levels for this form of potential. We 
then add a non-additive part to the energy, then the ground state degeneracy of low designable structures 
will be removed. We show there is a critical value for non-additive part of potential, where below this critical 
value the patterns of highly designable structures are fixed, but above this critical value the designability of 
structures is sensitive to the value of non-additive part of the potential. We show that the sequences which 
fold to highly designable structure are sensitive to mutation of some sites. 

An additive potential has the following advantages: 

a) It allows us to prepare a very fast algorithm which is then possible to run on a PC. 

b) It enables us to solve and study some parts of the problem by combinatorial methods. 

c) It gives a clear picture for designability. 

d) A ladder spectrum for the energy levels results, thus it arms us to study the problem for non-additive 
potentials. 
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II. The Model 



We consider an H-P lattice model ||. In this model only non-sequential nearest neighbours interact. 
Because the native structures of proteins are compact with the H type monomers sitting in the core, the 
effective potentials which are usually used, all of the forces are attractive (negative values for potential) and 
the strength of the force between H-H monomers is greater than others. We can write the general form of 
the potential in an arbitrary energy scale as: 

E PP = 0, E HP = -1, E HH = -2- 1 . (1) 



The most usual choice of H-P model potential corresponds to the limit 7 3> 1 |Q-^,^2| , however physical 
arguments are consistent with a smaller value for 7, for instance 7 = 0.3 was used by Li et al. fig] . They 
have calculated the energy of all of 2 27 sequences in 103, 346 compact configuration for a 27-sites cube, by 
a huge enumeration. 

In the case 7 = 0, we have an additive potential. If we let H = —1, and P = 0, we can rewrite the 
potential in the form, 

= <7i + <Jj- (2) 



Following Li et al. |1 
sites of a 3 x 3 x 3 cube 



we consider only compact structures of sequences with length 27, occupying all 
There are 103, 346 compact configurations which are not related to each other 



by rotation and reflection symmetries. Let us call the set of all compact structures, the structure space. 
A protein of length N may be shown by an ./V-component vector 



W) = 



(3) 



where i n = 1,2 refers to P and H residues. Thus the number of such TV-component vectors for proteins with 
length 27 is 2 27 . Let us call the set of \a), the sequence space. 

Because of the additive form of the potential, we can write the energy of a given |er) in any spatial 
configuration as, 

27 

E = Yjpm, (4) 

where p^s are the number of non-sequential neighbours of the ith monomer, or by introducing the neigh- 
bourhood vector \P), 

E = (a\P). (5) 



The vector |P) has 27 components and at ith component has the number of neighbours of the ith monomer. 
Due to the shape of \P) the type of neighbours is not relevant and all we have to do is count the non-sequential 
neighbours. This gives us an additional symmetry for the energy that is different from spatial symmetries. 
For example any of the sites in a two dimensional 5x5 square for two spatial configurations which are shown 
in figs, la and lb, have equal neighbours, but the labels of their neighbours are not the same. Visualisation 
of the same effect in 3 dimensions is a bit harder, but it dose exist. 

The space of all 3 dimensional structures has 103, 347 members for all compact full filled structures in a 
3x3x3 cube. Due to this additional symmetry this space is divided into 6291 subspaces, where all members 
of each subspace have the same \P). Let the number of members of a subspace be, iV^. The range of Nd is 
from 1 to 96. Fig. 2, shows that the frequency of large values of Nd, is low. Interestingly there are a lot of 
P)'s which only point to one structure. 
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FIG. 1. The number of neighbours for corresponding sites in these two configuration are the same, but the 
neighbours are not, for instance, site 18 in (a) is the neighbour of 5, but they are not adjacent in (b). 
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FIG. 2. Histogram of Nd for members of structure space. It is interesting that there are some P sets with Nd = 1. 



We have calculated the energy of all 2 27 |cr) on all \P). We find the degeneracy of ground state in structure 
space. One can see the distribution of number of ground state degeneracies g, for all of 2 27 sequences in fig. 
3. There are only a few sequences which have non-degenerate ground state, this corresponds to the 8.47% 
of sequences at g = 1. If energy of one sequence is minimised in a \P) with N c i greater than one it has 
degenerate ground state. According to definition of designability, such sequences should not be considered. 
The distribution of N s is presented for 7 = in fig. 4. Comparing this figure with fig. 2 of Li et al. JToj ] , 
we observe that there is no similarity. This suggests that designability (N s ), is sensitive to the value of 7, 
which is 7 = in our work, whereas Li et al. chose 7 = 0.3. However as we shall see later, the fact that at 
7 = 0, we have an additive potential plays an important role. In fact a small value of 7 radically change the 
picture. 
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FIG. 3. Histogram of degeneracy of ground state. The sequences which have non-degenerate ground state, 
correspond only to g — 1, in this diagram. 
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FIG. 4. Histogram of N s for additive potential. 



If we consider all of sequences which have non-degenerate ground state in structure space, we get a new 
picture for designability. This means that we calculate the designability of all |P)'s, and not only those with 
Nd = 1. This is contrast to N s which had only Nd — 1. 

To recognise this difference, we show designability of structures, by N' s . Fig. 5 shows the distribution of 
N' s . Many of points in this fig. 5 are related to some |P)'s with Nd ^ 1. We shall use this picture to express 
the nature of the energy gap in the case 7 ^ in section V. 
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FIG. 5. Histogram of N' a for additive potential. Note that many of points in this diagram correspond to some 
\P)'s which point to more than one spatial configuration. 



In our enumeration we have calculated the energy of any sequence in all 6291 |-P)'s, but in fig. 5 , we 
show the results for 3153 |-P)'s which are not related to each other by reverse labelling. We can not reduce 
the structure space according to this symmetry before enumeration. Reverse labelling for a nonsymmetric 
sequence gives two different configurations which may have different energies. 

III. Energy levels 



The number of non-sequential neighbours is related to type of site. A 3 x 3 x 3 cube has 8 corner sites 
(C), 12 link sites (L), 6 face sites (F), and one centre site (O) (fig. 6). C sites have three neighbours, where 
two of them are connected by sequential links and there is only one non-sequential neighbour. Similarly L, 
F and O sites have 2, 3 and 4 non-sequential neighbours respectively. We must add 1 to these numbers for 
two ends of chain. 
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FIG. 6. A 3 x 3 x 3 cube has 8 corner sites, 12 link sites, 6 face sites and 1 centre site. 
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This sites are divided in two classes, {C,F} and {L,0}. In a self avoiding walk in this cube, we must 
jump in any step from one set to other. The first set has 14 members and the second has 13. Thus a walk 
passes through C and F sites in odd steps, and through L and O sites in even steps. In other words, the 
odd components of |P) are 1 or 3, and even components are 2 or 4, (except the 1st and 27th components 
which arc like even components). Thus, 

|P) = bi,...,Pa7>, (6) 

where, 

f 1,3 odd i's; () 
Pl ~ \ 2,4 even i's. [ ' 

Therefore the energy for a sequence a a in a structure P^ is 

Ea^ = (c a |P^) 

= ^ (p/ii ~ l)<7ai + X! (Pv* ~ 2 )' T "' i + X! ' Jal + 2 X! ° a% - ^> 
zGodd iGcvcn iGodd iGeven 

By introducing the new binary variable x the above can be rewritten as, 

27 

E afl = 2Xfj,iagi + ^ (T a j + 2 ^ cr Qi , (9) 

i— 1 iGodd zEeven 

where, 

»i = I ? Pt = I m ? (10) 

II ft = 3or4, v y 

Two last terms in eq. (0) are independent of \X) or |P), thus they result in a constant, which can be ignored 
when comparing energies of a sequence in different configurations. The first term in eq. (^) is an integer 
times two, thus it results in a ladder energy spectrum with gaps of 2. Therefore the energy gap for all of 
structures is the same, and there is no difference between low designable and high designable structures. 

IV. Combinatorial Approach 

Our aim is to find the N' s for any spatial configuration, determined by a vector |P). Because \X) has a 
simpler structure, than |P), we shall use \X) instead of |P). Any vector \X), has seven I's and twenty 0's. 
One of the I's is in the even sites, and the others are in odd sites. Energy could be calculated by performing 
a "logical and" of two binary numbers (|<r) and \X}). For example, a typical |P) is, 

41331111313132 

P. 

2242222222222 

To recognise odd and even components of the vectors, we show them in the above form, writing the even 
sites below. The vector \X) corresponding to the above \P) is, 

10110000101010 

X. 

0010000000000 
On the other hand |er)'s have a similar form: 
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HOHHHO OHOHHOHH 

a. 

OHOHO OHHHO OH 

Where we show P monomers by numerical equivalence of them. Recall that numeric equivalence for H 
monomers is —1. Energy of any sequence in any spatial configuration is calculated by inner product of its 
\a) to corresponding \X). For the above \a) and \X) the energy is 5H. This value is related to exact value 
of energy according to eq.Q by a factor of two and two sequence dependent additional terms, since we 
are interested in the ground state and the energy gap of a sequence, the sequence dependent term may be 
ignored, as structure determines these quantities alone. 

By construction any \X) has six l's in odd sites, and one 1 in even sites. If we don't consider any other 
constraint for \X), we obtain an upper limit for number of |A~)'s. 

n = m x Q 3 ^ = 39039. (11) 

This is far larger than the number of possible \X)'s which we have obtained by enumeration, that is 6291. 
The fact that all 39039 possible configuration don't exist points to extra constraints which are yet to be 
discussed. If all 39039 of A)'s were to exist each of them would have to be unique ground state of only one 
sequence, thus removing all interest! To see this, it is enough to insert an H into \X) where ever one finds 
a 1, and P for zeros. Indeed absence of some of these vectors in real world makes some of the other more 
preferable in nature. 

The connectivity of a self avoiding walk, further constraints the \X). For example to pass through centre 
site, the walk has to pass through two face sites. This means that the only 1 (corresponding to centre site) in 
even sites must be sandwiched between two l's in odd sites (face sites). This constraint reduces the number 
of possible \X)'s. Two l's in odd sites are fixed by even 1, and only 12 sites remain for four other l's. Then 
there are, 

n = Q 2 ) x = 6435, (12) 

vectors. This number is still larger than exact number of \X)s by 144. Although due to our enumeration 
we know these 144 vectors, we can not find the complex constraints which prune them out, and we shall 
continue our calculation as though these 144 vectors were correct. Of course the values are different from 
exact enumeration, however it can be seen that this difference is not too large, and it may be considered as 
an approximation to the exact solution. Also we aid a computer enumeration including the extra 144 vectors 
and have compared the results with the combinatorial calculation. This has served as a check on our code. 

We now proceed to calculate N' s for the following example: 




First let us introduce some new parameters and notations. We will show the energy of a |er) in an \X) as: 

E = E(a, b,c) = (a + b + c)H, (13) 

where a, b and c are related to energy parts which come from centre (1 in lower row), faces which are 
connected to centre, and energy of other parts, respectively. For example energy of following \a): 

H0HH0H0 0HH0 OHH 
OHO OHHHO HO OHO 

in \X ), is £(0,2,3) = 5H. 
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Besides, we name the number of pairs of l's in the upper row of \X) as z and the number of l's in two 
ends of vectors as y. For \Xq), z = 2, and y = 1. 

Now we try to count the number of all polymers which have their energy minimised in \Xq) and, there is 
no other \X) with energy equal to ground state for them. To do this we discuss all possible cases. 

i: £7(1,2,4) 

Such polymers have at least seven H sites corresponding to l's of Xq. These polymers have minimum 
possible energy, thus Xq is a minimum energy configuration for them. But it must be checked whether it is 
a unique ground state or not. First consider polymers which in addition to these seven H's have another H 
monomer in their upper row sites, 



HOHHHO 0H0H0H0 

00H0000000000 
The energy of this sequence in following \X) is 7H too. 




Then the ground state of polymers which have additional H monomers in corresponding to upper row O's of 
Xq, is degenerate, and they don't count in N' s of Xq. The above discussion is independent of value of a and 
b in E(a, b, 4), and degrees of freedom to choose sites for H monomers is limited to lower row sites. 

For the \X) with z^l (like Xq) polymers can not have H monomers in the lower sites between two upper 
row l's. For example, the following sequence, 



HOHHHO OHO OHO 

02, 

00HH000000000 

has energy 7H in following \X) too. 




Then the contribution of polymers with £'(1,2,4) in N' s is: 

N' s {i) = 2 12 -( z -!) = 2 13 ~ z (14) 

ii: £(0,2,4) 

In this case if z > 1 (such as Xq) the ground state is degenerate. It can be seen that any sequence with 
energy £ = (0, 2, 4) in Xq state has the same energy in Xi state. In the case z = 1, only the sites in lower 
row by condition that they are not a neighbour of corresponding upper l's of X, have freedom to be an H 
or P monomer. There are 2 x 6 — z — y sites which don't have this freedom in lower row. Then, 

f 2 2 -y z - 1 

KW = {1 z z l\ (is) 
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iii: £(1,0,4) 

In this case there is only one sequence with nondegenerate ground state. For our example, X , this 
sequence is, 



H000H000H000H0 
00H0000000000 

In the above sequence changing any P monomer to H type, will cause the ground state to becomes 
degenerate. Then, 

Nftii) = 1 (16) 

iv: £(1,1,4) 

For this case b is 1, and if this 1 comes from right or left neighbour of lower 1, it has different solutions. 
Then we introduce new parameters (zr, ur) and (zl, Ul), which are similar to old z and y, when right or 
left neighbour 1 of lower 1 will be omitted. For Xq we have zr = 0, Zl = 1 arid ur = ul = y = 1- By 
introducing this new parameters this case is very similar to case ii, and the difference comes from number 
of corresponding l's in upper row (five instead six), and no restriction in value of z. Then, 

A^(iv) = 2 3+ZR+yR + 2 3+ZL+yL . (17) 

v: Other cases 

All of the other cases for ground state energy are degenerate, and need not be considered. 
With this analysis it is possible to find N' s for any \X). For our X example it is, 

= 2 11 
N' s (ii) = 
N' a (ui) = 1 
N' S (W) = 2 4 + 2 5 , 

that gives, 

N' S (X ) = 2097 

In this way all of the values of N'^s can be calculated. Had the 144 additional structures been taken out, the 
calculation of N' s for the problem would correspond to enumeration exactly. However taking these structures 
out is too complex and would have to be done case by case. Besides of the value of N' s , this calculation 
shows that the sequences whit non-degenerate ground state have between 4 to 6 H type monomers in face 
sites and no one in corner sites. Indeed in our model the stability of polymers is very sensitive to mutation 
in corner sites. 

V. Nonadditive potentials 

In the case 7^0 the potential is non-additive. In this case we can write the energy of ath sequence in 
/xth spatial configuration as: 

E^ = {a a \P^)-\ 1 {a a \M IJ \a a ). (18) 

Where a and P are the sequence and neighbourhood vectors, that introduced in previous sections. M is the 
adjacency matrix for this configuration. 
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Mi. 



1 if the ith and jth monomers are adjacent; 
otherwise. 



(19) 



Any \P) has Nd different M-matrices. The first term in eq. (|18|) was calculated in the case 7 = 0, and 
we need calculate only the last part. The aim of our calculation is to find the ground state. In any compact 
configuration in a 3 x 3 x 3 cube, there are 28 non-sequential neighbour pairs. Thus the contribution of the 
last term in energy is less than 287. We have shown that energy spectrum for the previous case has a ladder 
structure with energy gap equal to 2. In this case these split to some sublevels (fig. 7). Then if we choose 
7 < the levels are separate. Of course this is a lower estimation for 7. In the next section we will obtain 
a better estimate for lower and upper limits of the critical value of 7. 
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FIG. 7. Energy levels of additive potential split to sublevels for non-additive potential. 



From the result of the additive potential we have a subset P in the space of all spatial configurations 
which gives the minimum energy to folding. This P subset has Nd members which all of them have the same 
\P). For small 7's the ground state and the first excited state are between these Nd structures, and it is not 
necessary to calculate the energy for all of 103, 346 spatial structures for any sequence, except for sequences 
which their ground state is in structures with Nd = 1. For The Nd = 1 structures the value of N s does not 
change, and it is not necessary to run the program. The first excited state of these sequences are in an other 
P subset. Thus to find the energy gap for them the program must be run over all of the 103, 346 structures. 
We have calculated this energy spectrum, and have found the new N s for all 103, 346 structures. We show 
the results for 51,704 configuration which are unrelated by reverse labelling symmetry in fig. 8. We have 
found the energy gap for first excited state for all sequences. You can see the diagram of mean of energy 
gap vs. N s in fig. 9. This figure shows that highly designable structures which are related to P subsets with 
one member. 
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FIG. 8. Histogram of N„ for non-additive potential. 
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FIG. 9. The mean of energy gap vs. N s . There is a jump in energy gap for highly designable structures. All of 
these highly designable structures have N d = 1. 



In this enumeration we have calculated the energy spectrum for all of the sequences which have nondegen- 
erate ground state for the additive potential. We had removed some of the sequences because of degeneracy 
of ground state in additive potential case. It is possible that this degeneracy will be removed by the non- 
additive part of the potential, and some of the sequences have unique ground state for non-additive potential. 
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But the energy gap for these sequences is of order of 7, and if we consider them it causes a shift in horizontal 
axes to bigger N s and bring down the points nearer to 7 value in vertical direction in fig. 9. These make 
this figure more similar to results of Li et al. |nj . In their work the energy gap for low designable structures 
are of order of 7 ( they choose 7 = 0.3) also. 

VI. Estimation of 7 C 

The energy levels for the additive potential have a ladder structure, as it had been proven in previous 
sections. The energy gaps between the levels is 2 in our arbitrary energy unit. 

In the case of 7 ^ the energy has two parts (eq. [l8|). The first part comes from additive part of potential 
and does not change. Second part comes from non-additive part of potential, and is equal to number of H-H 
non-sequential neighbours in spatial configuration. Because this non-additive part, the energy spectrum is 
changed, and any level is splited to some sublevels (fig. 7). 

If contribution of the second part to energy is less than 2, for all structures, the ground state and excited 
state of any polymer is between \P) partners of its ground state for additive potential , except for \P) with 
Nd = 1, where there is only ground state. 

Let Seo be the difference of ground state energies of additive and non-additive potentials, and 5e\ be 
difference energy of first excited state in the case of 7 = with minimum of new energies for the sequence 
in the structures corresponding to these excited states (there is no uniqueness constraint for excited states). 
If Seo — Se± < 2 the ground state doesn't change and the values of N s for structures that we presented in 
past section don't change. By increasing 7, the absolute values of Seo and 5e% increase. 

To find the difference between Seg and 8e\ one have to calculate the difference in H-H contacts in ground 
state structure and maximum of H-H contacts in excited level structures. 

This difference has two sources. Because the energy levels in the case 7 = are separated by 2, then 
difference of them comes from replacing a H monomer from O site to an L site, or from an F site to a C site. 
Both of them cause increasing in energy by 2. But it is possible that these replacing decrease the energy by 
27. For example consider one F site with no H neighbour will go to one C site with two non-sequential H 
neighbours (this monomer must be an end residue), then this gives an upper limit for j c , that is 1. 

The other source for increasing the H-H contacts, comes from replacing H monomers in L and F sites by 
the same type sites. These changes only are relevant in the case 7 7^ 0. The maximum of increasing in H-H 
contacts because these replacing are 67, related to the sequences which have 4 H monomers in the F sites 
and 5 to 7 in L sites in their ground state structures. Thus the lower limit for 7 C is = 0.25. Therefore, 

0.25 < 7 C < 1. (20) 

This shows that there is a non zero value for j c , which for 7 less than it, the ground state structure of 
sequences doesn't change. Indeed 7 C distinguishes two phases. If 7 < -f c , the degree of designability of 
structures is independent of 7, and the change in value of 7 only changes the energy gaps. On the other 
hand for 7 > j c , the designability of structures becomes sensitive to the value of 7, and the patterns of 
highly designable structures will be changed if the potential changes. 

If the designability is the answer of "why has the nature selected a small fraction of possible configurations 
for folded states?", the above discussion shows that this selection is potential independent if 7 < j c , and 
sensitive to inter monomers interactions if 7 > j c . 
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